statistical thinking
Computational Thinking in the Era of Data Science
Recent years have seen the integration of computer science, mathematicsa and statistics, together with real-world domain knowledge, into a new research and applications field: data science.4 Just as data science integrates knowledge and skills from computer science, statistics, and a real-world application domain, data thinking, we propose, integrates computational thinking, statistical thinking, and domain thinking. Computational thinking was first introduced by Papert13 and, a quarter of a century later, was illuminated and elaborated on by Wing.15 As it turns out, exploring the novelty of data thinking uncovers new facets of computational thinking. In this Viewpoint, we first present our interpretation of the concept of data thinking and then, based on insights gained from the discussion about data thinking, we propose a timely need has emerged to introduce data thinking into computer science education along with computational thinking, in the context of various real-world domains using real-life data.
Statistical Thinking for the 21st Century
The goal of this book is to the tell the story of statistics as it is used today by researchers around the world. It's a different story than the one told in most introductory statistics books, which focus on teaching how to use a set of tools to achieve very specific goals. This book focuses on understanding the basic ideas of statistical thinking -- a systematic way of thinking about how we describe the world and use data make decisions and predictions, all in the context of the inherent uncertainty that exists in the real world. It also brings to bear current methods that have only become feasible in light of the amazing increases in computational power that have happened in the last few decades. Analyses that would have taken years in the 1950's can now be completed in a few seconds on a standard laptop computer, and this power unleashes the ability to use computer simulation to ask questions in new and powerful ways.
From Data-Driven to Data Science-Driven
In 2010, while still in grad school finishing up my degree, I launched a small consulting company selling my data analytics skills. The motto on my website was "Data Driven." Original, I know, but I was attempting to capture a trend coming out of the '90s. Back in the '90s and early 2000s, the idea that businesses should be more "data driven" was catching like wildfire. As evidence mounted that creating a culture of data driven decision making was helping businesses to outperform their less data savvy competition, more and more enterprises began to require their leaders to be data literate. Gartner and Forrester started to develop metrics evaluating just how "data driven" companies were.
The Essential Data Science Venn Diagram
A few years ago, Drew Conway came up with and shared his now-ubiquitous Data Science Venn Diagram. It was helpful, and we all were enlightened. A number of variants followed, and I am here suggesting my own enhancements. Hopefully they will be enlightening as well. I am proposing two additions: the differentiating of statistical applications (multivariate vs. non-multivariate) and the addition of discipline essences (i.e., the main contribution or function of each skill set).
Is Medicine Mesmerized by Machine Learning? Statistical Thinking
BD Horne et al wrote an important paper Exceptional mortality prediction by risk scores from common laboratory tests that apparently garnered little attention, perhaps because it used older technology: standard clinical lab tests and logistic regression. Yet even putting themselves at a significant predictive disadvantage by binning all the continuous lab values into fifths, the authors were able to achieve a validated c-index (AUROC) of 0.87 in predicting death within 30d in a mixed inpatient, outpatient, and emergency department patient population. Their model also predicted 1y and 5y mortality very well, and performed well in a completely independent NHANES cohort1. It also performed very well when evaluated just in outpatients, a group with very low mortality. The above model, called by the authors the Intermountain Risk Score, used the following predictors: age, sex, hematocrit, hemoglobin, red cell distribution width, mean corpuscular volume, red blood cell count, platelet count, mean platelet volume, mean corpuscular hemoglobin, mean corpuscular hemoglobin concentration, total white blood count, sodium, potassium, chloride, bicarbonate, calcium, glucose, creatinine, and BUN2.
40 Python Statistics For Data Science Resources
For an introduction to statistics, this tutorial with real-life examples is the way to go. The notebooks of this tutorial will introduce you to concepts like mean, median, standard deviation, and the basics of topics such as hypothesis testing and probability distributions. A fine way to start your stats learning, since it is inspired by the books "Think Bayes" and "Think Stats", which are two top recommendations that will come back below! If you're looking for books, you can try out this free book on computational statistics in Python, which not only contains an introduction to programming with Python, but also treats topics such as Markov Chain Monte Carlo, the Expectation-Maximization (EM) algorithm, resampling methods, and much more. Or you can buy this book by Thomas Haslwanter for a general introduction to common statistical tests, linear regression analysis and topics from survival analysis and Bayesian statistics. Note that this book does take life and medical sciences as an application area. Both of the above books already introduce you to more advanced statistics topics with Python too, as you can see. If you're a fan of videos, you should consider watching this tutorial on statistical data analysis with SciPy with Christopher Fonnesbeck, an Assistant Professor in the Department of Biostatistics at the Vanderbilt University School of Medicine.
Statistical Thinking is not science fiction, but a data science necessity
Over 100 years ago, the great science fiction writer H. G. Wells was credited with saying, "Statistical thinking will one day be as necessary for efficient citizenship as the ability to read or write." It is clear that this statement is probably more true today than ever, as Big Data and Analytics are paraded before every aspect of life, business, government, and social media experience. Statistical thinking is the bedrock of data science as statistics is a core methodology for many disciplines, including experimental science, operations research, decision sciences, and marketing research. Yet many appear to have forgotten this (or maybe have let it "slip their mind") -- see the recent article by the American Statistical Association (ASA) President, Dr. Marie Davidian: "Aren't We Data Science?" As we read this, we need to remember also that Data Science includes several core methodologies (disciplines): machine learning (data mining), visualization, data management (including data structures, indexing, modeling, taxonomies), applied mathematics, semantics (ontologies), and application-specific discipline science, as well as the original core "data science" of statistics!